Which Grocery Store…Denver?

Motivation

Imagine: I’m in the group chat. My friend Arielle is psyching herself up to go get groceries, and upon a query, writes, “it’s the worst. in part because my grocery store is the worst. I think it would be less stressful if I had a better store. it’s understocked, so poorly staffed, checkout takes ages and either you have to bag your own groceries or you have to wait forever in line just to do self checkout but they still don’t turn on more of the self check out kiosks for some reason. outside lots of unsavory types loiter and it’s uncomfortable, everything smells like piss.”

Then Laura writes: “Andrew you should map all the grocery stores by driving distance and reviews to determine the best option.”

Say no more, fam.

Initial Code

So to get started, we need to think about how we are going to access data on all the grocery stores nearby my friend’s house. I’ve previously done an analysis like this for the groupchat when Arielle’s sister was moving to Salt Lake City and needed to find housing. There turned out to be a Craigslist API then. And the time I needed to pull song lyrics for a witter bot: Genius.com has an API too. Surely google maps will let me access their reviews for all the nearby stores?

I do a little googling and I find that there’s a new app–new to me at least–called googleway. It covers a host of map related APIs that I didn’t realize existed. Do I need to generate an API key? I go to their website and I’m stunned to find I have a google maps API key from another project from 2.5 years ago live and ready to go.

Initial fumbling with this gives me some locked results and I realize that for each individual service that the googleway / google maps API allows, I need to go to my API key and enable that service.

The first two commands are for the googleway API. My key is saved in a not-included setup chunk for privacy. The third line here is registering with the ggmaps package–my old standby1.

set_key(key = key)
google_keys()
register_google(key = key)

Having logged in and connected, we need to run our first data pull.

res <- google_places(location = arielle,
                     keyword = "Grocery Store",
                     radius = 5000,
                     key = key)

this is calling the arielle object I made earlier, which is a list of two elements representing her longitude and latitude in Denver.

I expected the returned object from this to be a data frame, so you can imagine my befuddlement the first few times I ran this and got a url in the console. The first few URLs I plugged in threw the error I mentioned above (access denied) and then I was able to run it properly. The next code calls on that URL and converts its content from JSON format to a tidy data frame that I can use.

So in this place, which I cannot show because it would expose my key, I have code to the effect of:

groceries <-jsonlite::fromJSON("https://url generated by last step goes here")

and I use the results of that to generate a tibble calling on the following nested lists:

groceries_tib<-cbind(groceries$result$name,
      groceries$result$geometry$location$lat,
      groceries$result$geometry$location$lng,
      groceries$result$rating, 
      groceries$result$user_ratings_total) %>% 
  as_tibble()
gro1<-groceries_tib %>% 
  rename(name=V1, 
         latitude = V2,
         longitude = V3, 
         rating = V4, 
         n_ratings = V5) %>% 
  arrange(desc(rating))

But because there is a rate limit of 20 and I had already decided to pull the first 40 grocery stores, I have to run this twice to get gro2 to match gro1 that I just created. The second run is a bit more intensive because we are calling a “next page” of results from google.

res_next <- google_places(location = arielle,
                          keyword = "Grocery Store",
                          page_token = groceries$next_page_token,
                          radius = 5000,
                          key = key)

This should be the last code I have to hide for a while, but I did the same URL -> JSON business I did with the first section. I clean it up:

groceries_2_tib<-cbind(groceries_2$result$name,
                   groceries_2$result$geometry$location$lat,
                   groceries_2$result$geometry$location$lng,
                   groceries_2$result$rating, 
                   groceries_2$result$user_ratings_total) %>% 
  as_tibble()


gro2<-groceries_2_tib %>% 
  rename(name=V1, 
         latitude = V2,
         longitude = V3, 
         rating = V4, 
         n_ratings = V5) %>% 
  arrange(desc(rating))

And now we can join them together into one proper data frame and examine what we have.

gro<-dplyr::bind_rows(gro1,gro2)


gro %>% 
  arrange(desc(rating)) %>% 
  head(10) %>% 
  gt::gt() %>% 
   gt::tab_header(
    title = "Top Ten Rated Grocery Stores in Denver"
  ) 
Top Ten Rated Grocery Stores in Denver
name latitude longitude rating n_ratings
Ready Foods Inc 39.7401931 -105.0208209 5 1
Sun Market 39.7496388 -104.9711721 4.9 21
Town & Country Market Produce 39.7399005 -104.9371343 4.9 17
Pete's Fruits & Vegetables 39.7129002 -104.9220065 4.7 164
Denver Market 39.7715998 -105.006592 4.6 27
Trader Joe's 39.7268649 -104.9827877 4.6 2295
Marczyk Fine Foods 39.7429609 -104.9780866 4.6 771
12th Ave Market 39.7349414 -104.9558239 4.5 62
Trader Joe's 39.7287708 -104.9403211 4.5 3497
Leevers Locavore 39.7689136 -105.019854 4.5 411

This tells us some good information: first of all, the highest-rated grocery store only has one review, so that’s dubious and we need to keep that in mind from now on. It also puts two Trader Joe’s in the top 10, which strikes me as appropriate.

What about the bottom ten?

gro %>% 
  arrange(desc(rating)) %>% 
  tail(10) %>% 
  gt::gt()%>% 
   gt::tab_header(
    title = "Ten Worst Grocery Stores in Denver"
  ) 
Ten Worst Grocery Stores in Denver
name latitude longitude rating n_ratings
King Soopers 39.7310976 -104.9735003 4 1364
Safeway 39.755634 -105.0242664 3.9 1637
Argonant Groceries 39.7402164 -104.9838854 3.9 11
King Soopers 39.7374907 -104.9176658 3.9 485
King Soopers 39.737521 -104.997947 3.8 1159
Town Grocery 39.7574123 -104.9736486 3.7 3
Rio Grande Grocery Store 39.7290492 -105.0022845 3.7 7
Safeway 39.7486925 -104.9775013 3.6 2352
Denver Beet Box 39.7876684 -104.9759828 3 1
Decatur Fresh 39.7327007 -105.0218541 0 0

Ah. So the safeway (known to my friend as the “unsafeway”) in question is actually the lowest rated grocery store in Denver (at least first 40 hits) that has more than 1 rating. So we know we’re dealing with some real garbage here.

Distance

Haversine

In my original version of this, I just wanted to knock off and go eat dinner and produce a result ASAP, so I did an old standby and adjusted the data set to create a haversine distance from each grocery store to my friend’s coordinates. That was simple because I already had code to that effect sitting around from a project in graduate school.

# degrees to radians
deg2rad<-function(d){
  return(d * (pi/180))
}

# calculate haversine distance between two coordinates. 
haversine_km<-function(lat1,lon1,lat2,lon2){
  Rad <- 6371 # Radius of earth in km
  dLat <- deg2rad(lat2-lat1)
  dLon <- deg2rad(lon2-lon1)
  a <- sin(dLat/2)*sin(dLat/2) + cos(deg2rad(lat1)) * cos(deg2rad(lat2)) * sin(dLon/2) * sin(dLon/2)
  c <- 2 * atan2(sqrt(a), sqrt(1-a))
  d <- Rad*c # distance in km
  return(d)
}

Then it was a simple matter of mutating a new column with that function…. except it was at this point that I first noticed the problem that my variables were all in character and would need to be changed to as.numeric to do any kind of calculations. whoops.

gro <- gro %>% 
  mutate(latitude = as.numeric(latitude),
         longitude= as.numeric(longitude),
         rating = as.numeric(rating),
         n_ratings = as.numeric(n_ratings)
         )

gro <- gro%>% 
  mutate(haversine = haversine_km(latitude, longitude, arielle[1], arielle[2]))

Having my friend’s coordinates stored as a variable ended up being way handier than I would have anticipated and I think there’s a lesson there. We talk about creating variables up-top and never hard coding numbers when avoidable, but the lesson goes the other way too, in terms of making a variable to save yourself time copying and pasting numbers even if they’ll never change in the course of your coding.

Google Route Times

At this point, I had already made a version of Visualization One down below, but I challenged myself to re-do it with travel time. Everyone knows that sometimes a closer location takes longer to get to because highways, traffic, and ephemera like that. This is especially true in American cities, where large highways can often cut off adjacent neighborhoods from one another. So let’s dig back into that google API and see what they have for us.

First, we do two data pulls, which are generated as you run the calculations. So these calls are calculating all 20 routes from each of the original two gro1 and gro2 objects to the arielle object and generating a product called drive_ as the case needs.

# split in 2 due to rate limit of 25
drive1 <- google_distance(origins = gro1[1:20, c("latitude", "longitude")],
                      destinations = arielle,
                      key = key)
drive2 <- google_distance(origins = gro2[1:20, c("latitude", "longitude")],
                          destinations = arielle,
                          key = key)

Once these are generated, I use the dirty old trick shown here to apply the nested data frames on to the gro objects from before:

# apply the nested list to the original datasets from above
gro1$drive<-drive1$rows$elements
gro2$drive<-drive2$rows$elements

This updated both data frames so that the final variable was actually a list that contained multiple dataframes. See:

gro2 %>% 
  dplyr::select(name, rating, drive)
## # A tibble: 20 x 3
##    name                          rating drive       
##    <chr>                         <chr>  <list>      
##  1 Sun Market                    4.9    <df [1 x 3]>
##  2 Town & Country Market Produce 4.9    <df [1 x 3]>
##  3 Pete's Fruits & Vegetables    4.7    <df [1 x 3]>
##  4 Trader Joe's                  4.6    <df [1 x 3]>
##  5 Marczyk Fine Foods            4.6    <df [1 x 3]>
##  6 Trader Joe's                  4.5    <df [1 x 3]>
##  7 Leevers Locavore              4.5    <df [1 x 3]>
##  8 Natural Grocers               4.4    <df [1 x 3]>
##  9 Swansea Corner Store          4.4    <df [1 x 3]>
## 10 Park Hill Supermarket         4.4    <df [1 x 3]>
## 11 Holly Food Market             4.3    <df [1 x 3]>
## 12 Villa Park Mini Mart          4.3    <df [1 x 3]>
## 13 Target Grocery                4.3    <df [1 x 3]>
## 14 Whole Foods Market            4.2    <df [1 x 3]>
## 15 Gem Store                     4.2    <df [1 x 3]>
## 16 Natural Grocers               4.2    <df [1 x 3]>
## 17 Marczyk Fine Foods            4.2    <df [1 x 3]>
## 18 Rio Grande Grocery Store      3.7    <df [1 x 3]>
## 19 Denver Beet Box               3      <df [1 x 3]>
## 20 Decatur Fresh                 0      <df [1 x 3]>

Well! That’s a problem but it’s one we can deal with. We’re going to make the adjustment for each set (gro1 and gro2) and then re-bind them together. Why not bind first and fix second? I don’t know, I was getting a funny error for dataset 2 but not 1 and I wanted to fix that first and then I never went back to re-factor the code.

# unnest again, excluding distance here because we want drive time and the name overlap borks the process. 
# we can actually get both if we want by also unnesting distance but that's neither here nor there. 
gro1 <-gro1 %>% 
  dplyr::select(name, latitude, longitude, rating, n_ratings, drive) %>% 
  unnest_wider(drive) %>% 
  unnest_wider(duration) %>% 
  mutate(minutes = value/60)

# fix
gro1 <-gro1 %>% 
  mutate(latitude = as.numeric(latitude),
         longitude= as.numeric(longitude),
         rating = as.numeric(rating),
         n_ratings = as.numeric(n_ratings)
  )
# do all in one step. 
gro2<-gro2 %>% 
  dplyr::select(name, latitude, longitude, rating, n_ratings, drive) %>% 
  unnest_wider(drive) %>% 
  unnest_wider(duration) %>% 
  # assuming value = seconds (it works out to rounding)
  mutate(minutes = value/60) %>% 
  mutate(latitude = as.numeric(latitude),
         longitude= as.numeric(longitude),
         rating = as.numeric(rating),
         n_ratings = as.numeric(n_ratings)
  )

# bind and arrange. 
gro<-dplyr::bind_rows(gro1,gro2)
gro<-gro %>% 
  arrange(desc(rating)) %>% 
  dplyr::select(-status)

## pro-tip: always append dplyr:: when using select because it often conflicts with other packages.

A few notes: I immediately assumed that the value variable under “duration” was seconds, and was pleasantly rewarded when, after dividing that column by sixty to get a minutes column, that turned out to be true.

Secondly, why did I unnest duration list but not unnest the distance list? Because I already had distance covered by the prior analysis and didn’t care about google’s opinion on that. I care about how long it takes my friend to drive there at this point.

On to analysis!

Visualization One

First, I generate a second data frame representing the top ten grocery stores. I’m going to label them on the graph but the labels for the crappy ones are irrelevant (barring one in particular: the unsafeway).

gro_top_n <- gro %>% 
  arrange(desc(rating)) %>% 
  head(10)

Then plug it all together to get the graph:

gro %>% 
  filter(rating > 3.45) %>% 
  ggplot(aes(x = minutes, y = rating, size = n_ratings, color = rating)) +
  geom_text_repel(data= gro_top_n, 
            mapping = aes(x = minutes, y = rating, label = name), 
            inherit.aes = FALSE, 
            fill = "white",
            #family = "Sofia",
            min.segment.length = 0) +
  geom_point() +
  labs(x = "Minutes Drive Time (Google Maps)",
       y = "Average Rating", 
       title = "Which Grocery Store Should Arielle Go to?", 
       size = "Number of Ratings") + 
  annotate("text", x = 5, y = 3.5, label = "The Unsafe-way", 
           #family = "Sofia"
           ) + 
  scale_color_viridis_c(direction = -1) + 
  geom_curve(aes(x = 5, y =  3.5, xend = 1.72, yend =  3.6),
             arrow = arrow(), size = 0.2, color = "black")+ 
  scale_y_continuous(limits = c(3.45,5)) + 
  theme_light() + 
  #theme(text = ggplot2::element_text(family = "Sofia")) +
  NULL

Note that I filtered out any grocery stores whose ratings are below the unsafeway–that’s the two that had 3.0 and 0.0 respectively, and I don’t care about them. They were so far outlying that they were making the rest of the graph look warped.

This was great and I was really satisfied, then my early-career-mentor chimed in on social media to say, why not make a voronoi diagram?

Visualization Two

The first step here is to get a map that covers the area we are considering. First, I used the regular get_map function and plotted all the grocery stores. Then I resized as needed until I felt like I had a sense of the coordinates I needed and then used those to generate the boundingbox variable required by the get_stamenmap function. Stamen’s maps are more artsy and–importantly–less text-laden than the google map results, which was important because I was going to be labeling the grocery stores and I needed something more clean.

boundingbox<-c(left = -105.025, bottom = 39.71, right =  -104.925, top =39.79 )
denver3<-get_stamenmap(boundingbox, zoom = 14, maptype = "toner-lite")

denver3 %>% ggmap()

Frustratingly, to make the layer for voronoi diagrams go all the way to the outer edge of the map, you have to define a separate bounding box for them, which, annoyingly, has a different parameter structure as well–you have to give the coordinates, in order, for each individual corner point. The bounding box parameters were just the farthest longitude or latitude in each particular direction–much easiser. But ggvoronoi doesn’t presume our product is a rectangle, so we have to give each point.

outline.df <- data.frame(x = c(-105.025, -104.925,-104.925, -105.025),
                         y = c(39.71, 39.71,39.79, 39.79))
gro_filtered<-gro %>% 
  filter(rating > 3.5)      
ggmap(denver3,
      # base layer allows the same kind of basics that go in the original ggplot call without a map. 
      base_layer = ggplot(data=gro_filtered, aes(x = longitude,y=latitude, label = name)) 
      ) +
  geom_point()+
  ggvoronoi::geom_voronoi(
    aes(fill = rating),
    color = "black", 
    alpha = 0.5, 
    outline = outline.df) +
  geom_text_repel( 
    #family = "Sofia",
                  min.segment.length = 0) +
  annotate("point", x = arielle[2], y = arielle[1], color = "red") +
  scale_fill_viridis_c(direction = -1) + 
  theme_light() + 
  #theme(text = ggplot2::element_text(family = "Sofia")) +
  labs(x = "Longitude",
       y = "Latitude", 
       title = "Which Grocery Store Should Arielle Go to?") + 
  NULL

This gets me to the real platonic ideal of this analysis: no matter where you are in the city, you can see where your closest grocery store is and what its rating is. And we can see that Arielle’s red dot really is in the worst possible column and that she is relatively close to a bodega with higher ratings and a 10 minute drive to some good Trader Joe’s locations. No need to ever go to the unsafeway again.


  1. I once made a shiny app that called down the google maps API any time a user toggled anything. I was running several hundred requests a session just to look at different maps of Africa. I’m reasonably certain I’m personally responsible for API requiring an account and a key now↩︎